Domain-Independent Structured Duplicate Detection

نویسندگان

Rong Zhou

Eric A. Hansen

چکیده

The scalability of graph-search algorithms can be greatly extended by using external memory, such as disk, to store generated nodes. We consider structured duplicate detection, an approach to external-memory graph search that limits the number of slow disk I/O operations needed to access search nodes stored on disk by using an abstract representation of the graph to localize memory references. For graphs with sufficient locality, structured duplicate detection outperforms other approaches to external-memory graph search. We develop an automatic method for creating an abstract representation that reveals the local structure of a graph. We then integrate this approach into a domain-independent STRIPS planner and show that it dramatically improves scalability for a wide range of planning problems. The success of this approach strongly suggests that similar local structure can be found in many other graph-search problems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel Structured Duplicate Detection

We describe a novel approach to parallelizing graph search using structured duplicate detection. Structured duplicate detection was originally developed as an approach to externalmemory graph search that reduces the number of expensive disk I/O operations needed to check stored nodes for duplicates, by using an abstraction of the search graph to localize memory references. In this paper, we sho...

متن کامل

A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates

Data mining algorithms generally assume that data will be clean and consistent. However, in practice, this is not always the case, and for this reason the detection and elimination of duplicate records is an important part of data cleaning. The presence of similar-duplicate records causes over-representation of data. If the database contains different representations of the same data, the resul...

متن کامل

Fuzzy Duplicate Detection on XML Data

XML is popular for data exchange and data publishing on the Web, but it comes with errors and inconsistencies inherent to real-world data. Hence, there is a need for XML data cleansing, which requires solutions for fuzzy duplicate detection in XML. The hierarchical and semi-structured nature of XML strongly differs from the flat and structured relational model, which has received the main atten...

متن کامل

Detection of Duplicate Objects in Semi Structured Data like XML

Duplicate detection is the process of finding the duplicate objects in the data. This is the important part of data cleansing step of data mining. Duplication occurs when some real world object has multiple representations in data source. Significant amount of work has been done in duplicate detection of relational data, but only recently the researchers have shifted their focus towards duplica...

متن کامل

Unsupervised duplicate detection using sample non-duplicates

The problem of identifying objects in databases that refer to the same real world entity, is known, among others, as duplicate detection or record linkage. Objects may be duplicates, even though they are not identical due to errors and missing data. Traditional scenarios for duplicate detection are data warehouses, which are populated from several data sources. Duplicate detection here is part ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Domain-Independent Structured Duplicate Detection

نویسندگان

چکیده

منابع مشابه

Parallel Structured Duplicate Detection

A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates

Fuzzy Duplicate Detection on XML Data

Detection of Duplicate Objects in Semi Structured Data like XML

Unsupervised duplicate detection using sample non-duplicates

عنوان ژورنال:

اشتراک گذاری